Lasso Regression
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
method that performs both
variable selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
and
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...
in order to enhance the prediction accuracy and interpretability of the resulting
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
. It was originally introduced in
geophysics Geophysics () is a subject of natural science concerned with the physical processes and physical properties of the Earth and its surrounding space environment, and the use of quantitative methods for their analysis. The term ''geophysics'' som ...
, and later by
Robert Tibshirani Robert Tibshirani (born July 10, 1956) is a professor in the Departments of Statistics and Biomedical Data Science at Stanford University. He was a professor at the University of Toronto from 1985 to 1998. In his work, he develops statistical to ...
, who coined the term. Lasso was originally formulated for
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
models. This simple case reveals a substantial amount about the estimator. These include its relationship to
ridge regression Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates do not need to be unique if
covariate Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s are
collinear In geometry, collinearity of a set of points is the property of their lying on a single line. A set of points with this property is said to be collinear (sometimes spelled as colinear). In greater generality, the term has been used for aligned ...
. Though originally defined for linear regression, lasso regularization is easily extended to other statistical models including
generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...
s,
generalized estimating equation In statistics, a generalized estimating equation (GEE) is used to estimate the parameters of a generalized linear model with a possible unmeasured correlation between observations from different timepoints. Although some believe that Generalized es ...
s,
proportional hazards model Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional hazar ...
s, and
M-estimator In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estima ...
s. Lasso's ability to perform subset selection relies on the form of the constraint and has a variety of interpretations including in terms of
geometry Geometry (; ) is, with arithmetic, one of the oldest branches of mathematics. It is concerned with properties of space such as the distance, shape, size, and relative position of figures. A mathematician who works in the field of geometry is c ...
,
Bayesian statistics Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about the event, ...
and
convex analysis Convex analysis is the branch of mathematics devoted to the study of properties of convex functions and convex sets, often with applications in convex minimization, a subdomain of optimization theory. Convex sets A subset C \subseteq X of s ...
. The LASSO is closely related to
basis pursuit denoising In applied mathematics and statistics, basis pursuit denoising (BPDN) refers to a mathematical optimization problem of the form : \min_x \left(\frac \, y - Ax\, ^2_2 + \lambda \, x\, _1\right), where \lambda is a parameter that controls the trade ...
.


History

Lasso was introduced in order to improve the prediction accuracy and interpretability of regression models. It selects a reduced set of the known covariates for use in a model. Lasso was developed independently in geophysics literature in 1986, based on prior work that used the \ell^1
penalty Penalty or The Penalty may refer to: Sports * Penalty (golf) * Penalty (gridiron football) * Penalty (ice hockey) * Penalty (rugby) * Penalty (rugby union) * Penalty kick (association football) * Penalty shoot-out (association football) * Penalty ...
for both fitting and penalization of the coefficients. Statistician Robert Tibshirani independently rediscovered and popularized it in 1996, based on Breiman's nonnegative garrote. Prior to lasso, the most widely used method for choosing covariates was stepwise selection. That approach only improves prediction accuracy in certain cases, such as when only a few covariates have a strong relationship with the outcome. However, in other cases, it can increase prediction error. At the time,
ridge regression Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
was the most popular technique for improving prediction accuracy. Ridge regression improves prediction error by shrinking the sum of the squares of the
regression coefficients In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is c ...
to be less than a fixed value in order to reduce
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
, but it does not perform covariate selection and therefore does not help to make the model more interpretable. Lasso achieves both of these goals by forcing the sum of the absolute value of the regression coefficients to be less than a fixed value, which forces certain coefficients to zero, excluding them from impacting prediction. This idea is similar to ridge regression, which also shrinks the size of the coefficients; however, ridge regression does not set coefficients to zero (and, thus, does not perform
variable selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
).


Basic form


Least squares

Consider a sample consisting of ''N'' cases, each of which consists of ''p''
covariate Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s and a single outcome. Let y_i be the outcome and x_i:=(x_1,x_2,\ldots,x_p)_i^T be the covariate vector for the ''i'' th case. Then the objective of lasso is to solve : \min_ \left\ \text \sum_^p , \beta_j, \leq t. Here \beta_0 is the constant coefficient, \beta:=(\beta_1,\beta_2,\ldots, \beta_p) is the coefficient vector, and t is a prespecified free parameter that determines the degree of regularization. Letting X be the covariate matrix, so that X_ = (x_i)_j and x_i^T is the ''i'' th row of X, the expression can be written more compactly as : \min_ \left\ \text \, \beta \, _1 \leq t, where \, u \, _p = \left( \sum_^N , u_i , ^p \right)^ is the standard \ell^p norm. Denoting the scalar mean of the data points x_i by \bar and the mean of the response variables y_i by \bar, the resulting estimate for \beta_0 is \hat_0 = \bar - \bar^T \beta , so that : y_i - \hat_0 - x_i^T \beta = y_i - ( \bar - \bar^T \beta ) - x_i^T \beta = ( y_i - \bar ) - ( x_i - \bar )^T \beta, and therefore it is standard to work with variables that have been made zero-mean. Additionally, the covariates are typically
standardized Standardization or standardisation is the process of implementing and developing technical standards based on the consensus of different parties that include firms, users, interest groups, standards organizations and governments. Standardization ...
\textstyle \left( \sum_^N x_^2 = 1 \right) so that the solution does not depend on the measurement scale. It can be helpful to rewrite : \min_ \left\ \text \, \beta \, _1 \leq t. in the so-called
Lagrangian Lagrangian may refer to: Mathematics * Lagrangian function, used to solve constrained minimization problems in optimization theory; see Lagrange multiplier ** Lagrangian relaxation, the method of approximating a difficult constrained problem with ...
form : \min_ \left\ where the exact relationship between t and \lambda is data dependent.


Orthonormal covariates

Some basic properties of the lasso estimator can now be considered. Assuming first that the covariates are
orthonormal In linear algebra, two vectors in an inner product space are orthonormal if they are orthogonal (or perpendicular along a line) unit vectors. A set of vectors form an orthonormal set if all vectors in the set are mutually orthogonal and all of un ...
so that x_i^T x_j = \delta_ , where \delta_ is the
Kronecker delta In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise: \delta_ = \begin 0 &\text i \neq j, \\ 1 &\ ...
, or, equivalently, X^T X = I , then using subgradient methods it can be shown that : \begin \hat_j = & S_( \hat^\text_j ) = \hat^\text_j \max \left( 0, 1 - \frac \right) \\ & \text \hat^\text = (X^T X)^ X^T y = X^T y \end S_\alpha is referred to as the ''soft thresholding operator'', since it translates values towards zero (making them exactly zero if they are small enough) instead of setting smaller values to zero and leaving larger ones untouched as the ''hard thresholding operator'', often denoted H_\alpha , would. In ridge regression the objective is to minimize : \min_ \left\ Using X^TX = I and the ridge regression formula: \hat = \left( (X^T X) + N\lambda I \right)^ X^T y , it yields: : \hat_j = ( 1 + N \lambda )^ \hat^\text_j. Ridge regression shrinks all coefficients by a uniform factor of (1 + N \lambda)^ and does not set any coefficients to zero. . It can also be compared to regression with best subset selection, in which the goal is to minimize : \min_ \left\ where \, \cdot \, _0 is the " \ell^0 norm", which is defined as \, z \, = m if exactly m components of z are nonzero. In this case, it can be shown that : \hat_j = H_ \left( \hat^\text_j \right) = \hat^\text_j \mathrm \left( \left, \hat^\text_j \ \geq \sqrt \right) where H_\alpha is the so-called hard thresholding function and \mathrm is an indicator function (it is 1 if its argument is true and 0 otherwise). Therefore, the lasso estimates share features of both ridge and best subset selection regression since they both shrink the magnitude of all the coefficients, like ridge regression and set some of them to zero, as in the best subset selection case. Additionally, while ridge regression scales all of the coefficients by a constant factor, lasso instead translates the coefficients towards zero by a constant value and sets them to zero if they reach it.


Correlated covariates

In one special case two covariates, say ''j'' and ''k'', are identical for each observation, so that x_ = x_ , where x_ = x_ . Then the values of \beta_j and \beta_k that minimize the lasso objective function are not uniquely determined. In fact, if some \hat in which \hat_j \hat_k \geq 0 , then if s \in ,1 replacing \hat_j by s ( \hat_j + \hat_k ) and \hat_k by (1 - s ) ( \hat_j + \hat_k ) , while keeping all the other \hat_i fixed, gives a new solution, so the lasso objective function then has a continuum of valid minimizers. Several variants of the lasso, including the
Elastic net regularization In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Specification The elas ...
, have been designed to address this shortcoming.


General form

Lasso regularization can be extended to other objective functions such as those for
generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...
s,
generalized estimating equation In statistics, a generalized estimating equation (GEE) is used to estimate the parameters of a generalized linear model with a possible unmeasured correlation between observations from different timepoints. Although some believe that Generalized es ...
s,
proportional hazards model Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional hazar ...
s, and
M-estimator In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estima ...
s. Given the objective function : \frac \sum_^N f( x_i, y_i, \alpha, \beta ) the lasso regularized version of the estimator ''s'' the solution to : \min_ \frac \sum_^N f( x_i, y_i, \alpha, \beta ) \text \, \beta \, _1 \leq t where only \beta is penalized while \alpha is free to take any allowed value, just as \beta_0 was not penalized in the basic case.


Interpretations


Geometric interpretation

Lasso can set coefficients to zero, while the superficially similar ridge regression cannot. This is due to the difference in the shape of their constraint boundaries. Both lasso and ridge regression can be interpreted as minimizing the same objective function : \min_ \left\ but with respect to different constraints: \, \beta \, _1 \leq t for lasso and \, \beta \, _2^2 \leq t for ridge. The figure shows that the constraint region defined by the \ell^1 norm is a square rotated so that its corners lie on the axes (in general a
cross-polytope In geometry, a cross-polytope, hyperoctahedron, orthoplex, or cocube is a regular, convex polytope that exists in ''n''- dimensional Euclidean space. A 2-dimensional cross-polytope is a square, a 3-dimensional cross-polytope is a regular octahed ...
), while the region defined by the \ell^2 norm is a circle (in general an ''n''-sphere), which is rotationally
invariant Invariant and invariance may refer to: Computer science * Invariant (computer science), an expression whose value doesn't change during program execution ** Loop invariant, a property of a program loop that is true before (and after) each iteratio ...
and, therefore, has no corners. As seen in the figure, a convex object that lies tangent to the boundary, such as the line shown, is likely to encounter a corner (or a higher-dimensional equivalent) of a hypercube, for which some components of \beta are identically zero, while in the case of an ''n''-sphere, the points on the boundary for which some of the components of \beta are zero are not distinguished from the others and the convex object is no more likely to contact a point at which some components of \beta are zero than one for which none of them are.


Making λ easier to interpret with an accuracy-simplicity tradeoff

The lasso can be rescaled so that it becomes easy to anticipate and influence the degree of shrinkage associated with a given value of \lambda . It is assumed that X is standardized with z-scores and that y is centered (zero mean). Let \beta_0 represent the hypothesized regression coefficients and let b_ refer to the data-optimized ordinary least squares solutions. We can then define the
Lagrangian Lagrangian may refer to: Mathematics * Lagrangian function, used to solve constrained minimization problems in optimization theory; see Lagrange multiplier ** Lagrangian relaxation, the method of approximating a difficult constrained problem with ...
as a tradeoff between the in-sample accuracy of the data-optimized solutions and the simplicity of sticking to the hypothesized values. This results in : \min_ \left\ where q_i is specified below. The first fraction represents relative accuracy, the second fraction relative simplicity, and \lambda balances between the two. Given a single regressor, relative simplicity can be defined by specifying q_i as , b_-\beta_, , which is the maximum amount of deviation from \beta_0 when \lambda=0 . Assuming that \beta_=0, the solution path can be defined in terms of R^2: : b_ = \begin (1-\lambda/R^)b_ & \mbox \lambda \leq R^, \\ 0 & \mbox \lambda>R^. \end If \lambda=0, the ordinary least squares solution (OLS) is used. The hypothesized value of \beta_0=0 is selected if \lambda is bigger than R^2. Furthermore, if R^2=1, then \lambda represents the proportional influence of \beta_0=0. In other words, \lambda\times100\% measures in percentage terms the minimal amount of influence of the hypothesized value relative to the data-optimized OLS solution. If an \ell_2-norm is used to penalize deviations from zero given a single regressor, the solution path is given by b_=\bigg(1+\frac\bigg)^b_. Like b_, b_ moves in the direction of the point (\lambda = R^2, b=0) when \lambda is close to zero; but unlike b_, the influence of R^2 diminishes in b_ if \lambda increases (see figure).
Given multiple regressors, the moment that a parameter is activated (i.e. allowed to deviate from \beta_0) is also determined by a regressor's contribution to R^2 accuracy. First, : R^2=1-\frac. An R^2 of 75% means that in-sample accuracy improves by 75% if the unrestricted OLS solutions are used instead of the hypothesized \beta_0 values. The individual contribution of deviating from each hypothesis can be computed with the p x p matrix : R^=(X'\tilde y_0)(X'\tilde y_0)' (X'X)^(\tilde y_0'\tilde y_0)^, where \tilde y_0=y-X\beta_0. If b=b_ when R^2 is computed, then the diagonal elements of R^ sum to R^2. The diagonal R^ values may be smaller than 0 or, less often, larger than 1. If regressors are uncorrelated, then the i^ diagonal element of R^ simply corresponds to the r^2 value between x_i and y. A rescaled version of the adaptive lasso of can be obtained by setting q_=, b_-\beta_, . If regressors are uncorrelated, the moment that the i^ parameter is activated is given by the i^ diagonal element of R^. Assuming for convenience that \beta_0 is a vector of zeros, : b_ = \begin (1-\lambda/R_^)b_ & \mbox \lambda \leq R_^, \\ 0 & \mbox \lambda>R_^. \end That is, if regressors are uncorrelated, \lambda again specifies the minimal influence of \beta_0. Even when regressors are correlated, the first time that a regression parameter is activated occurs when \lambda is equal to the highest diagonal element of R^. These results can be compared to a rescaled version of the lasso by defining q_=\frac \sum_ , b_-\beta_, , which is the average absolute deviation of b_ from \beta_0. Assuming that regressors are uncorrelated, then the moment of activation of the i^ regressor is given by : \tilde \lambda_ = \frac\sqrt \sum_^p\sqrt. For p=1, the moment of activation is again given by \tilde \lambda_=R^2. If \beta_0 is a vector of zeros and a subset of p_B relevant parameters are equally responsible for a perfect fit of R^2=1, then this subset is activated at a \lambda value of \frac. The moment of activation of a relevant regressor then equals \frac\fracp_B\frac=\frac. In other words, the inclusion of irrelevant regressors delays the moment that relevant regressors are activated by this rescaled lasso. The adaptive lasso and the lasso are special cases of a '1ASTc' estimator. The latter only groups parameters together if the absolute correlation among regressors is larger than a user-specified value.


Bayesian interpretation

Just as ridge regression can be interpreted as linear regression for which the coefficients have been assigned normal
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
s, lasso can be interpreted as linear regression for which the coefficients have Laplace prior distributions. The Laplace distribution is sharply peaked at zero (its first derivative is discontinuous at zero) and it concentrates its probability mass closer to zero than does the normal distribution. This provides an alternative explanation of why lasso tends to set some coefficients to zero, while ridge regression does not.


Convex relaxation interpretation

Lasso can also be viewed as a convex relaxation of the best subset selection regression problem, which is to find the subset of \leq k covariates that results in the smallest value of the objective function for some fixed k \leq n , where n is the total number of covariates. The " \ell^0 norm", \, \cdot \, _0 , (the number of nonzero entries of a vector), is the limiting case of "\ell^p norms", of the form \textstyle \, x \, _p = \left( \sum_^n , x_j , ^p \right)^ (where the quotation marks signify that these are not really norms for p < 1 since \, \cdot \, _p is not convex for p < 1 , so the triangle inequality does not hold). Therefore, since p = 1 is the smallest value for which the " \ell^p norm" is convex (and therefore actually a norm), lasso is, in some sense, the best convex approximation to the best subset selection problem, since the region defined by \, x \, _1 \leq t is the
convex hull In geometry, the convex hull or convex envelope or convex closure of a shape is the smallest convex set that contains it. The convex hull may be defined either as the intersection of all convex sets containing a given subset of a Euclidean space ...
of the region defined by \, x \, _p \leq t for p < 1 .


Generalizations

Lasso variants have been created in order to remedy limitations of the original technique and to make the method more useful for particular problems. Almost all of these focus on respecting or exploiting dependencies among the covariates.
Elastic net regularization In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Specification The elas ...
adds an additional ridge regression-like penalty that improves performance when the number of predictors is larger than the sample size, allows the method to select strongly correlated variables together, and improves overall prediction accuracy. Group lasso allows groups of related covariates to be selected as a single unit, which can be useful in settings where it does not make sense to include some covariates without others. Further extensions of group lasso perform variable selection within individual groups (sparse group lasso) and allow overlap between groups (overlap group lasso).Puig, Arnau Tibau, Ami Wiesel, and Alfred O. Hero III.
A Multidimensional Shrinkage-Thresholding Operator
. Proceedings of the 15th workshop on Statistical Signal Processing, SSP'09, IEEE, pp. 113–116.
Jacob, Laurent, Guillaume Obozinski, and Jean-Philippe Vert.
Group Lasso with Overlap and Graph LASSO
. Appearing in Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009.
Fused lasso can account for the spatial or temporal characteristics of a problem, resulting in estimates that better match system structure.Tibshirani, Robert, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. 2005. “Sparsity and Smoothness via the Fused lasso”. Journal of the Royal Statistical Society. Series B (statistical Methodology) 67 (1). Wiley: 91–108. https://www.jstor.org/stable/3647602. Lasso-regularized models can be fit using techniques including subgradient methods,
least-angle regression In statistics, least-angle regression (LARS) is an algorithm for fitting linear regression models to high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Suppose we expect a response variable ...
(LARS), and
proximal gradient methods Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems. Many interesting problems can be formulated as convex optimization problems of the form \operatorname\limits_ \sum_^n ...
. Determining the optimal value for the regularization parameter is an important part of ensuring that the model performs well; it is typically chosen using cross-validation.


Elastic net

In 2005, Zou and Hastie introduced the elastic net. When ''p'' > ''n'' (the number of covariates is greater than the sample size) lasso can select only ''n'' covariates (even when more are associated with the outcome) and it tends to select one covariate from any set of highly correlated covariates. Additionally, even when ''n'' > ''p'', ridge regression tends to perform better given strongly correlated covariates. The elastic net extends lasso by adding an additional \ell^2 penalty term giving : \min_ \left\, which is equivalent to solving : \begin \min_ \left\ & \text ( 1 - \alpha ) \, \beta \, _1 + \alpha \, \beta \, _2^2 \leq t, \\ & \text \alpha = \frac. \end This problem can be written in a simple lasso form : \min_ \left\ letting : X_^* = ( 1 + \lambda_2 )^ \binom ,   y_^* = \binom, \qquad \lambda^* = \frac ,   \beta^* = \sqrt \beta. Then \hat = \frac , which, when the covariates are orthogonal to each other, gives : \hat_j = \frac \max \left( 0, 1 - \frac \right) = \frac \max \left( 0, 1 - \frac \right) = ( 1 + \lambda_2 )^ \hat^\text_j. So the result of the elastic net penalty is a combination of the effects of the lasso and ridge penalties. Returning to the general case, the fact that the penalty function is now strictly convex means that if x_ = x_ , \hat_j = \hat_k , which is a change from lasso. In general, if \hat_j \hat > 0 : \frac \leq \lambda_2^ \sqrt, \text \rho = X^t X, is the sample correlation matrix because the x 's are normalized. Therefore, highly correlated covariates tend to have similar regression coefficients, with the degree of similarity depending on both \, y \, _1 and \lambda_2 , which is different from lasso. This phenomenon, in which strongly correlated covariates have similar regression coefficients, is referred to as the grouping effect. Grouping is desirable since, in applications such as tying genes to a disease, finding all the associated covariates is preferable, rather than selecting one from each set of correlated covariates, as lasso often does. In addition, selecting only one from each group typically results in increased prediction error, since the model is less robust (which is why ridge regression often outperforms lasso).


Group lasso

In 2006, Yuan and Lin introduced the group lasso to allow predefined groups of covariates to jointly be selected into or out of a model. This is useful in many settings, perhaps most obviously when a categorical variable is coded as a collection of binary covariates. In this case, group lasso can ensure that all the variables encoding the categorical covariate are included or excluded together. Another setting in which grouping is natural is in biological studies. Since genes and proteins often lie in known pathways, which pathways are related to an outcome may be more significant than whether individual genes are. The objective function for the group lasso is a natural generalization of the standard lasso objective : \min_ \left\, \qquad \, z \, _ = ( z^t K_j z )^ where the
design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ob ...
X and covariate vector \beta have been replaced by a collection of design matrices X_j and covariate vectors \beta_j , one for each of the J groups. Additionally, the penalty term is now a sum over \ell^2 norms defined by the positive definite matrices K_j . If each covariate is in its own group and K_j = I , then this reduces to the standard lasso, while if there is only a single group and K_1 = I , it reduces to ridge regression. Since the penalty reduces to an \ell^2 norm on the subspaces defined by each group, it cannot select out only some of the covariates from a group, just as ridge regression cannot. However, because the penalty is the sum over the different subspace norms, as in the standard lasso, the constraint has some non-differential points, which correspond to some subspaces being identically zero. Therefore, it can set the coefficient vectors corresponding to some subspaces to zero, while only shrinking others. However, it is possible to extend the group lasso to the so-called sparse group lasso, which can select individual covariates within a group, by adding an additional \ell^1 penalty to each group subspace. Another extension, group lasso with overlap allows covariates to be shared across groups, e.g., if a gene were to occur in two pathways.


Fused lasso

In some cases, the phenomenon under study may have important spatial or temporal structure that must be considered during analysis, such as time series or image-based data. In 2005, Tibshirani and colleagues introduced the fused lasso to extend the use of lasso to this type of data. The fused lasso objective function is : \begin & \min_\beta \left\ \\ pt& \text \sum_^p , \beta_j, \leq t_1 \text \sum_^p , \beta_j - \beta_, \leq t_2. \end The first constraint is the lasso constraint, while the second directly penalizes large changes with respect to the temporal or spatial structure, which forces the coefficients to vary smoothly to reflect the system's underlying logic. Clustered lasso is a generalization of fused lasso that identifies and groups relevant covariates based on their effects (coefficients). The basic idea is to penalize the differences between the coefficients so that nonzero ones cluster. This can be modeled using the following regularization: : \sum_^p , \beta_i - \beta_j, \leq t_2. In contrast, variables can be clustered into highly correlated groups, and then a single representative covariate can be extracted from each cluster. Algorithms exist that solve the fused lasso problem, and some generalizations of it. Algorithms can solve it exactly in a finite number of operations.


Quasi-norms and bridge regression

Lasso, elastic net, group and fused lasso construct the penalty functions from the \ell^1 and \ell^2 norms (with weights, if necessary). The bridge regression utilises general \ell^p norms ( p \geq 1 ) and quasinorms (0< p < 1 ).Fu, Wenjiang J. 1998.
The Bridge versus the Lasso
. Journal of Computational and Graphical Statistics 7 (3). Taylor & Francis: 397-416.
For example, for ''p''=1/2 the analogue of lasso objective in the Lagrangian form is to solve : \min_ \left\, where : \, \beta \, _=\left(\sum_^p \sqrt \right)^2 It is claimed that the fractional quasi-norms \ell^p (0< p < 1 ) provide more meaningful results in data analysis both theoretically and empirically. The non-convexity of these quasi-norms complicates the optimization problem. To solve this problem, an expectation-minimization procedure is developedGorban, A.N.; Mirkes, E.M.; Zinovyev, A. (2016)
Piece-wise quadratic approximations of arbitrary error functions for fast and robust machine learning.
Neural Networks, 84, 28-38.
and implementedMirkes E.M
PQSQ-regularized-regression repository
GitHub.
for minimization of function : \min_ \left\, where \vartheta(\gamma) is an arbitrary concave monotonically increasing function (for example, \vartheta(\gamma)=\sqrt gives the lasso penalty and \vartheta(\gamma)=\gamma^ gives the \ell^ penalty). The efficient algorithm for minimization is based on piece-wise
quadratic approximation In calculus, Taylor's theorem gives an approximation of a ''k''-times differentiable function around a given point by a polynomial of degree ''k'', called the ''k''th-order Taylor polynomial. For a smooth function, the Taylor polynomial is the t ...
of subquadratic growth (PQSQ).


Adaptive lasso

The adaptive lasso was introduced by Zou in 2006 for linear regression and by Zhang and Lu in 2007 for proportional hazards regression.Zhang and Lu (2007, Biometrika)


Prior lasso

The prior lasso was introduced for generalized linear models by Jiang et al. in 2016 to incorporate prior information, such as the importance of certain covariates. In prior lasso, such information is summarized into pseudo responses (called prior responses) \hat^ and then an additional criterion function is added to the usual objective function with a lasso penalty. Without loss of generality, in linear regression, the new objective function can be written as : \min_ \left\, which is equivalent to : \min_ \left\, the usual lasso objective function with the responses y being replaced by a weighted average of the observed responses and the prior responses \tilde = (y + \eta\hat^) / (1 + \eta) (called the adjusted response values by the prior information). In prior lasso, the parameter \eta is called a balancing parameter, in that it balances the relative importance of the data and the prior information. In the extreme case of \eta = 0 , prior lasso is reduced to lasso. If \eta = \infty , prior lasso will solely rely on the prior information to fit the model. Furthermore, the balancing parameter \eta has another appealing interpretation: it controls the variance of \beta in its prior distribution from a Bayesian viewpoint. Prior lasso is more efficient in parameter estimation and prediction (with a smaller estimation error and prediction error) when the prior information is of high quality, and is robust to the low quality prior information with a good choice of the balancing parameter \eta .


Computing lasso solutions

The loss function of the lasso is not differentiable, but a wide variety of techniques from convex analysis and optimization theory have been developed to compute the solutions path of the lasso. These include coordinate descent,Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent”. Journal of Statistical Software 33 (1): 1-21. https://www.jstatsoft.org/article/view/v033i01/v33i01.pdf. subgradient methods,
least-angle regression In statistics, least-angle regression (LARS) is an algorithm for fitting linear regression models to high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Suppose we expect a response variable ...
(LARS), and proximal gradient methods.Efron, Bradley, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. 2004. “Least Angle Regression”. The Annals of Statistics 32 (2). Institute of Mathematical Statistics: 407–51. https://www.jstor.org/stable/3448465.
Subgradient In mathematics, the subderivative, subgradient, and subdifferential generalize the derivative to convex functions which are not necessarily differentiable. Subderivatives arise in convex analysis, the study of convex functions, often in connectio ...
methods are the natural generalization of traditional methods such as
gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
and
stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
to the case in which the objective function is not differentiable at all points. LARS is a method that is closely tied to lasso models, and in many cases allows them to be fit efficiently, though it may not perform well in all circumstances. LARS generates complete solution paths. Proximal methods have become popular because of their flexibility and performance and are an area of active research. The choice of method will depend on the particular lasso variant, the data and the available resources. However, proximal methods generally perform well.


Choice of regularization parameter

Choosing the regularization parameter (\lambda) is a fundamental part of lasso. A good value is essential to the performance of lasso since it controls the strength of shrinkage and variable selection, which, in moderation can improve both prediction accuracy and interpretability. However, if the regularization becomes too strong, important variables may be omitted and coefficients may be shrunk excessively, which can harm both predictive capacity and inferencing. Cross-validation is often used to find the regularization parameter. Information criteria such as the
Bayesian information criterion In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on ...
(BIC) and the
Akaike information criterion The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
(AIC) might be preferable to cross-validation, because they are faster to compute and their performance is less volatile in small samples. An information criterion selects the estimator's regularization parameter by maximizing a model's in-sample accuracy while penalizing its effective number of parameters/degrees of freedom. Zou et al. proposed to measure the effective degrees of freedom by counting the number of parameters that deviate from zero. The degrees of freedom approach was considered flawed by Kaufman and Rosset and Janson et al., because a model's degrees of freedom might increase even when it is penalized harder by the regularization parameter. As an alternative, the relative simplicity measure defined above can be used to count the effective number of parameters. For the lasso, this measure is given by :\hat = \sum_^p \frac, which monotonically increases from zero to p as the regularization parameter decreases from \infty to zero.


Selected applications

LASSO has been applied in economics and finance, and was found to improve prediction and to select sometimes neglected variables, for example in corporate bankruptcy prediction literature, or high growth firms prediction.


See also

*
Least absolute deviations Least absolute deviations (LAD), also known as least absolute errors (LAE), least absolute residuals (LAR), or least absolute values (LAV), is a statistical optimality criterion and a statistical optimization technique based minimizing the ''sum o ...
*
Model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
*
Nonparametric regression Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship ...
*
Tikhonov regularization Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...


References

{{DEFAULTSORT:lasso (statistics) Regression analysis Machine learning algorithms